Data

The German credit risk data can be downloaded from the UCI Machine Learning repository. The data set has 1000 observations with 21 variables. There are categorical and numeric variables in this dataset.

Data Manipulation

The steps include, converting data to the required data types, using interpretable class labels, checking and omitting NAs in the data (if any).

Let’s look at the data! The data has 1000 rows and 21 variables

Variables in the data
Status_checking_account
Duration_in_month
Credit_history
Purpose
Credit_amount
Savings_account_bonds
Present_employment_since
Installment_rate_in_percentage_of_disp_income
Personal_status_and_sex
Guarantors
Present_residence_since
Property
Age
Other_installment_plans
Housing
Number_of_existing_credits_at_this_bank
Job
Number_of_dependants
Telephone
foreign_worker
Credit_Risk

Data Summary

Looking at the data summary for numeric variables

Duration_in_month Credit_amount Age
Min. : 4.0 Min. : 250 Min. :19.00
1st Qu.:12.0 1st Qu.: 1366 1st Qu.:27.00
Median :18.0 Median : 2320 Median :33.00
Mean :20.9 Mean : 3271 Mean :35.55
3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:42.00
Max. :72.0 Max. :18424 Max. :75.00

Frequency tables

Credit risk is the outcome variable. The frequency table for each variable vs the Credit risk is shown below. Cell counts also show per row proportions, for example: In case of the foreign worker variable, 30.7% of the foreign workers have credit risk label as “bad” and 69.3% of the foreign workers have “good” credit risk label.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                         | Credit_Risk 
## Status_checking_account |       Bad |      Good | Row Total | 
## ------------------------|-----------|-----------|-----------|
##              No_account |        46 |       348 |       394 | 
##                         |     0.117 |     0.883 |     0.394 | 
## ------------------------|-----------|-----------|-----------|
##                    lt_0 |       135 |       139 |       274 | 
##                         |     0.493 |     0.507 |     0.274 | 
## ------------------------|-----------|-----------|-----------|
##                  lt_200 |       105 |       164 |       269 | 
##                         |     0.390 |     0.610 |     0.269 | 
## ------------------------|-----------|-----------|-----------|
##                 gte_200 |        14 |        49 |        63 | 
##                         |     0.222 |     0.778 |     0.063 | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |       300 |       700 |      1000 | 
## ------------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                        | Credit_Risk 
##         Credit_history |       Bad |      Good | Row Total | 
## -----------------------|-----------|-----------|-----------|
##               Critical |        50 |       243 |       293 | 
##                        |     0.171 |     0.829 |     0.293 | 
## -----------------------|-----------|-----------|-----------|
##        delayed_in_past |        28 |        60 |        88 | 
##                        |     0.318 |     0.682 |     0.088 | 
## -----------------------|-----------|-----------|-----------|
##          No_credit_due |        25 |        15 |        40 | 
##                        |     0.625 |     0.375 |     0.040 | 
## -----------------------|-----------|-----------|-----------|
##          All_paid_duly |        28 |        21 |        49 | 
##                        |     0.571 |     0.429 |     0.049 | 
## -----------------------|-----------|-----------|-----------|
## All_existing_paid_duly |       169 |       361 |       530 | 
##                        |     0.319 |     0.681 |     0.530 | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |       300 |       700 |      1000 | 
## -----------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##              | Credit_Risk 
##      Purpose |       Bad |      Good | Row Total | 
## -------------|-----------|-----------|-----------|
##   Appliances |         4 |         8 |        12 | 
##              |     0.333 |     0.667 |     0.012 | 
## -------------|-----------|-----------|-----------|
##     Business |        34 |        63 |        97 | 
##              |     0.351 |     0.649 |     0.097 | 
## -------------|-----------|-----------|-----------|
##    Education |        22 |        28 |        50 | 
##              |     0.440 |     0.560 |     0.050 | 
## -------------|-----------|-----------|-----------|
##    Furniture |        58 |       123 |       181 | 
##              |     0.320 |     0.680 |     0.181 | 
## -------------|-----------|-----------|-----------|
##      New.car |        89 |       145 |       234 | 
##              |     0.380 |     0.620 |     0.234 | 
## -------------|-----------|-----------|-----------|
##       Others |         5 |         7 |        12 | 
##              |     0.417 |     0.583 |     0.012 | 
## -------------|-----------|-----------|-----------|
##      Repairs |         8 |        14 |        22 | 
##              |     0.364 |     0.636 |     0.022 | 
## -------------|-----------|-----------|-----------|
##   Retraining |         1 |         8 |         9 | 
##              |     0.111 |     0.889 |     0.009 | 
## -------------|-----------|-----------|-----------|
##   Television |        62 |       218 |       280 | 
##              |     0.221 |     0.779 |     0.280 | 
## -------------|-----------|-----------|-----------|
##     Used.car |        17 |        86 |       103 | 
##              |     0.165 |     0.835 |     0.103 | 
## -------------|-----------|-----------|-----------|
## Column Total |       300 |       700 |      1000 | 
## -------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                       | Credit_Risk 
## Savings_account_bonds |       Bad |      Good | Row Total | 
## ----------------------|-----------|-----------|-----------|
##            No_savings |        32 |       151 |       183 | 
##                       |     0.175 |     0.825 |     0.183 | 
## ----------------------|-----------|-----------|-----------|
##                lt_100 |       217 |       386 |       603 | 
##                       |     0.360 |     0.640 |     0.603 | 
## ----------------------|-----------|-----------|-----------|
##               100_500 |        34 |        69 |       103 | 
##                       |     0.330 |     0.670 |     0.103 | 
## ----------------------|-----------|-----------|-----------|
##              500_1000 |        11 |        52 |        63 | 
##                       |     0.175 |     0.825 |     0.063 | 
## ----------------------|-----------|-----------|-----------|
##               gt_1000 |         6 |        42 |        48 | 
##                       |     0.125 |     0.875 |     0.048 | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |       300 |       700 |      1000 | 
## ----------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                          | Credit_Risk 
## Present_employment_since |       Bad |      Good | Row Total | 
## -------------------------|-----------|-----------|-----------|
##               Unemployed |        23 |        39 |        62 | 
##                          |     0.371 |     0.629 |     0.062 | 
## -------------------------|-----------|-----------|-----------|
##                     1_yr |        70 |       102 |       172 | 
##                          |     0.407 |     0.593 |     0.172 | 
## -------------------------|-----------|-----------|-----------|
##                     4_yr |       104 |       235 |       339 | 
##                          |     0.307 |     0.693 |     0.339 | 
## -------------------------|-----------|-----------|-----------|
##                     7_yr |        39 |       135 |       174 | 
##                          |     0.224 |     0.776 |     0.174 | 
## -------------------------|-----------|-----------|-----------|
##                  gt_7_yr |        64 |       189 |       253 | 
##                          |     0.253 |     0.747 |     0.253 | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |       300 |       700 |      1000 | 
## -------------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                                               | Credit_Risk 
## Installment_rate_in_percentage_of_disp_income |       Bad |      Good | Row Total | 
## ----------------------------------------------|-----------|-----------|-----------|
##                                          0_20 |       159 |       317 |       476 | 
##                                               |     0.334 |     0.666 |     0.476 | 
## ----------------------------------------------|-----------|-----------|-----------|
##                                         20_25 |        45 |       112 |       157 | 
##                                               |     0.287 |     0.713 |     0.157 | 
## ----------------------------------------------|-----------|-----------|-----------|
##                                         25_35 |        62 |       169 |       231 | 
##                                               |     0.268 |     0.732 |     0.231 | 
## ----------------------------------------------|-----------|-----------|-----------|
##                                       35_plus |        34 |       102 |       136 | 
##                                               |     0.250 |     0.750 |     0.136 | 
## ----------------------------------------------|-----------|-----------|-----------|
##                                  Column Total |       300 |       700 |      1000 | 
## ----------------------------------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                         | Credit_Risk 
## Personal_status_and_sex |       Bad |      Good | Row Total | 
## ------------------------|-----------|-----------|-----------|
##           Male.divorced |        20 |        30 |        50 | 
##                         |     0.400 |     0.600 |     0.050 | 
## ------------------------|-----------|-----------|-----------|
##         Female.divorced |       109 |       201 |       310 | 
##                         |     0.352 |     0.648 |     0.310 | 
## ------------------------|-----------|-----------|-----------|
##             male.single |       146 |       402 |       548 | 
##                         |     0.266 |     0.734 |     0.548 | 
## ------------------------|-----------|-----------|-----------|
##            male.married |        25 |        67 |        92 | 
##                         |     0.272 |     0.728 |     0.092 | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |       300 |       700 |      1000 | 
## ------------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##              | Credit_Risk 
##   Guarantors |       Bad |      Good | Row Total | 
## -------------|-----------|-----------|-----------|
##         none |       272 |       635 |       907 | 
##              |     0.300 |     0.700 |     0.907 | 
## -------------|-----------|-----------|-----------|
## co_applicant |        18 |        23 |        41 | 
##              |     0.439 |     0.561 |     0.041 | 
## -------------|-----------|-----------|-----------|
##    guarantor |        10 |        42 |        52 | 
##              |     0.192 |     0.808 |     0.052 | 
## -------------|-----------|-----------|-----------|
## Column Total |       300 |       700 |      1000 | 
## -------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                         | Credit_Risk 
## Present_residence_since |       Bad |      Good | Row Total | 
## ------------------------|-----------|-----------|-----------|
##                 lt_1_yr |        36 |        94 |       130 | 
##                         |     0.277 |     0.723 |     0.130 | 
## ------------------------|-----------|-----------|-----------|
##                   1_4yr |        97 |       211 |       308 | 
##                         |     0.315 |     0.685 |     0.308 | 
## ------------------------|-----------|-----------|-----------|
##                   4_7yr |        43 |       106 |       149 | 
##                         |     0.289 |     0.711 |     0.149 | 
## ------------------------|-----------|-----------|-----------|
##                 gt_7_yr |       124 |       289 |       413 | 
##                         |     0.300 |     0.700 |     0.413 | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |       300 |       700 |      1000 | 
## ------------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##              | Credit_Risk 
##     Property |       Bad |      Good | Row Total | 
## -------------|-----------|-----------|-----------|
##  No.property |        67 |        87 |       154 | 
##              |     0.435 |     0.565 |     0.154 | 
## -------------|-----------|-----------|-----------|
##  Real.estate |        60 |       222 |       282 | 
##              |     0.213 |     0.787 |     0.282 | 
## -------------|-----------|-----------|-----------|
##    insurance |        71 |       161 |       232 | 
##              |     0.306 |     0.694 |     0.232 | 
## -------------|-----------|-----------|-----------|
##          car |       102 |       230 |       332 | 
##              |     0.307 |     0.693 |     0.332 | 
## -------------|-----------|-----------|-----------|
## Column Total |       300 |       700 |      1000 | 
## -------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                         | Credit_Risk 
## Other_installment_plans |       Bad |      Good | Row Total | 
## ------------------------|-----------|-----------|-----------|
##                    None |       224 |       590 |       814 | 
##                         |     0.275 |     0.725 |     0.814 | 
## ------------------------|-----------|-----------|-----------|
##                   banks |        57 |        82 |       139 | 
##                         |     0.410 |     0.590 |     0.139 | 
## ------------------------|-----------|-----------|-----------|
##                  stores |        19 |        28 |        47 | 
##                         |     0.404 |     0.596 |     0.047 | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |       300 |       700 |      1000 | 
## ------------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##              | Credit_Risk 
##      Housing |       Bad |      Good | Row Total | 
## -------------|-----------|-----------|-----------|
##         Free |        44 |        64 |       108 | 
##              |     0.407 |     0.593 |     0.108 | 
## -------------|-----------|-----------|-----------|
##         Rent |        70 |       109 |       179 | 
##              |     0.391 |     0.609 |     0.179 | 
## -------------|-----------|-----------|-----------|
##          Own |       186 |       527 |       713 | 
##              |     0.261 |     0.739 |     0.713 | 
## -------------|-----------|-----------|-----------|
## Column Total |       300 |       700 |      1000 | 
## -------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                                         | Credit_Risk 
## Number_of_existing_credits_at_this_bank |       Bad |      Good | Row Total | 
## ----------------------------------------|-----------|-----------|-----------|
##                                       1 |       200 |       433 |       633 | 
##                                         |     0.316 |     0.684 |     0.633 | 
## ----------------------------------------|-----------|-----------|-----------|
##                                       2 |        92 |       241 |       333 | 
##                                         |     0.276 |     0.724 |     0.333 | 
## ----------------------------------------|-----------|-----------|-----------|
##                                       3 |         6 |        22 |        28 | 
##                                         |     0.214 |     0.786 |     0.028 | 
## ----------------------------------------|-----------|-----------|-----------|
##                                       4 |         2 |         4 |         6 | 
##                                         |     0.333 |     0.667 |     0.006 | 
## ----------------------------------------|-----------|-----------|-----------|
##                            Column Total |       300 |       700 |      1000 | 
## ----------------------------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                   | Credit_Risk 
##               Job |       Bad |      Good | Row Total | 
## ------------------|-----------|-----------|-----------|
## Unemployed_NonRes |         7 |        15 |        22 | 
##                   |     0.318 |     0.682 |     0.022 | 
## ------------------|-----------|-----------|-----------|
##     Unskilled_Res |        56 |       144 |       200 | 
##                   |     0.280 |     0.720 |     0.200 | 
## ------------------|-----------|-----------|-----------|
##           skilled |       186 |       444 |       630 | 
##                   |     0.295 |     0.705 |     0.630 | 
## ------------------|-----------|-----------|-----------|
##        management |        51 |        97 |       148 | 
##                   |     0.345 |     0.655 |     0.148 | 
## ------------------|-----------|-----------|-----------|
##      Column Total |       300 |       700 |      1000 | 
## ------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                      | Credit_Risk 
## Number_of_dependants |       Bad |      Good | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                 lt_2 |        46 |       109 |       155 | 
##                      |     0.297 |     0.703 |     0.155 | 
## ---------------------|-----------|-----------|-----------|
##                 gt_2 |       254 |       591 |       845 | 
##                      |     0.301 |     0.699 |     0.845 | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |       300 |       700 |      1000 | 
## ---------------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##              | Credit_Risk 
##    Telephone |       Bad |      Good | Row Total | 
## -------------|-----------|-----------|-----------|
##           No |       187 |       409 |       596 | 
##              |     0.314 |     0.686 |     0.596 | 
## -------------|-----------|-----------|-----------|
##          Yes |       113 |       291 |       404 | 
##              |     0.280 |     0.720 |     0.404 | 
## -------------|-----------|-----------|-----------|
## Column Total |       300 |       700 |      1000 | 
## -------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                | Credit_Risk 
## foreign_worker |       Bad |      Good | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |         4 |        33 |        37 | 
##                |     0.108 |     0.892 |     0.037 | 
## ---------------|-----------|-----------|-----------|
##            Yes |       296 |       667 |       963 | 
##                |     0.307 |     0.693 |     0.963 | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       300 |       700 |      1000 | 
## ---------------|-----------|-----------|-----------|
## 
##  
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##              | Credit_Risk 
##  Credit_Risk |       Bad |      Good | Row Total | 
## -------------|-----------|-----------|-----------|
##          Bad |       300 |         0 |       300 | 
##              |     1.000 |     0.000 |     0.300 | 
## -------------|-----------|-----------|-----------|
##         Good |         0 |       700 |       700 | 
##              |     0.000 |     1.000 |     0.700 | 
## -------------|-----------|-----------|-----------|
## Column Total |       300 |       700 |      1000 | 
## -------------|-----------|-----------|-----------|
## 
## 

Measures of Association

• Chi-sq test of independence: to test whether two categorical variables are dependent or not. It evaluates whether there is a significant association between the categories of the two variables. A p-value less than 0.05(significance threshold) implies that the two variables are significantly associated to each other.

p.values
Status_checking_account 0.0000000
Credit_history 0.0000000
Purpose 0.0001157
Savings_account_bonds 0.0000003
Present_employment_since 0.0010455
Installment_rate_in_percentage_of_disp_income 0.1400333
Personal_status_and_sex 0.0222380
Guarantors 0.0360560
Present_residence_since 0.8615521
Property 0.0000286
Other_installment_plans 0.0016293
Housing 0.0001117
Number_of_existing_credits_at_this_bank 0.4451441
Job 0.5965816
Number_of_dependants 0.9240463
Telephone 0.2488438
foreign_worker 0.0094431
Credit_Risk 0.0000000

Visualizations

Let’s understand the data from the plots.

• Barplots: for categorical data showing the frequency color coded based on the outcome variable (Credit risk)

• Boxplots for numeric data showing the distributions color coded based on the outcome variable (Credit risk)